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Abstract 

In spite of much research effort, there is no universally applicable software reliability 
growth model which can be trusted to give accurate predictions of reliability in all 
circumstances. Worse, we are not even in a position to be able to decide a priori which 
of the many models is most suitable in a particular context. Our ownj^ecent work has 
tried to resolve this problem by developing techniques:whereby,/(?r each program, the 
accuracy of various models can be analysed. A user is thus enabled to select that model 
which is giving the most accurate reliability predictions for the particular program under 
examination. One_of these ways of analysing predictive accuracy, which we caU the u- 
plot, in fact allows a user to estimate the relationship between the predicted reliability 
and the true reliability. In this paper we show, how this can be used to improve 
reliability predictions in a completely general way by a process of recalibration. 
Simulation results show that the technique gives improved reliability predictions in a 
large proportion of cases. However, a user does not need to trust the efficacy of 
recalibration, since the new reliability estimates produced by the technique are truly 
predictive and so their accuracy in a particular application can be judged using the earlier 
methods. The generality of this approach would therefore suggest that it be applied as a 
matter of course whenever a software reliability model is used. 
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1 Introduction 


The earliest attempts to measure and predict the reliability of software occurred about 
twenty years ago. In spite of considerable research work in the intervening years, there 
is still no definitive method or model which can be universally recommended as ’best'. 
Perhaps this should not be surprising. Estimating and predicting software reliability is 
not easy. Perhaps the major difficulty is that we are concerned primarily with design 
faults. 

This situation is very different from that tackled by the conventional hardware reliability 
theory. Here the dramatic advances of the past quarter century have come from a 
concentration on the random processes of physical failure. Thus, for example, we now 
have a good understanding of how the reliabilities of complex hardware systems 
depend upon, on the one hand, the detailed system structure, on the other, the 
reliabilities of the constituent components. The very success of this physical hardware 
reliability theory, however, is now revealing the importance of design faults to the 
overall reliability of complex systems. Our ability to use intelligent strategies to 
minimise the effects of physical failure of components results in a higher proportion of 
system failures being caused by flawed designs. Such flaws in hardware systems are 
very similar to software faults: they represent the result of human misunderstandings. 
It seems likely, as a result of this, that obtaining good methods for measuring the effect 
of such flaws on hardware system reliability will be as difficult as measuring software 
reliability. 

Software has no significant physical manifestation. Software failures are merely 
inherent design faults revealing themselves under appropriate operational 
circumstances. These faults will have been resident in the software since their creation 
in the original design or in subsequent changes. We currently do not have good 
theories of how software faults come into being. Presumably such theories would 
require better understanding of human problem solving and the social processes 
involved in writing software; if so, we should perhaps look to social and psychological 
sciences, rather than physics, for solutions. In view of the comparative lack of success 
of these sciences in arriving at quantitative understanding, it would be wise not to 
expect any dramatic breakthrough in the short term. 

These difficulties notwithstanding, there have been important advances in software 
reliability modelling recently. In fact, there is now a plethora of models from which the 
user can choose in order to make reliability estimates and predictions. However, none 
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of these has been shown to be applicable in all circumstances, and we are not presently 
able to decide in a particular context which would be the most appropriate model to use. 
This presents difficulties for a potential user, who is solely interested in obtaining 
reliability measures in which he/she can have confidence. 

Our own recent work [1] has attempted to tackle this problem by devising means 
whereby judgements can be made about the accuracy of past predictions on a particular 
data source. The intention is that a user could apply such techniques, for each data 
source (program), to the results produced by several models and select the model which 
has so far performed best by giving the most accurate reliability predictions. It would 
then be sensible, in the absence of any other information, to use that model for the next 
prediction on that data source. This 'horses for courses' approach obviates the need for 
a priori selection of a model, instead each data source is provided with its 'best' model. 
Indeed, this 'best' model may change as more data is collected. 

These new methods of model selection work by analysing the closeness between 
predicted and actual failure behaviour. In particular, they provide information about 
two especially important types of departure which we call bias (or ill-calibration) and 
noise (or variability). The key idea in the present work is that this knowledge of the 
nature of past errors of prediction can be used to improve future predictions. The 
techniques to be described here are quite general and are not model-dependent. They 
will be shown to be effective in improving predictive accuracy in a high proportion of 
cases, but users need not take this efficacy on trust: their predictive accuracy in a 
particular case can be analysed, just like any other model, using our earlier techniques 
[ 1 ]. 


2 Reliability growth and predictive accuracy 


In its simplest form, the software reliability growth problem concerns the random 
variables Ti, T 2 , ... , T n , representing the execution times between successive failures 
as a program is being debugged. It is generally assumed that attempts are made at each 
failure to fix the fault which caused that failure. Models vary in the way that they 
represent this fault-finding and fixing operation: details of different approaches can be 
found elsewhere [ 1, 8, 15]. 
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At stage i, when observations ti, t2, ... , tj.i have been made of the first i-1 inter-failure 
times, the objective is to predict future failure behaviour represented by the unobserved 
Tj, Tj+i, ... random variables. Informally, the prediction problem is solved if we can 
accurately estimate the joint distribution of any finite subset of Tj, Tj+i ... . This 
statement, however, begs the question of what we mean by 'accurately', and it is this 
issue which forms a major part of our earlier work [1]. 

In practice, of course, a user will be satisfied with much less than a complete 
description of all future uncertainty. In many cases, for example, it will be sufficient to 
know the current reliability of the software under examination. This could be presented 
in many different forms: the reliability function, P(Tj < t); the current rate of occurence 
of failures (ROCOF), [3]; the mean (or median) time to next failure (mttf). 
Alternatively, a user may wish to predict when a target reliability, perhaps to be used as 
the criterion for termination of testing, will be achieved. 

If we accept that prediction is our goal, it can be seen that the usual discussion of 
competing software reliability growth models is misleading. We should, instead, be 
comparing the relative merits of prediction systems. A prediction system which will 
allow us to predict the future (T;, T, + i ...) from the past (ti, t 2 , ... tj_i) comprises: 

(i) the probabilistic model which specifies the distribution of any subset of the Tj's 
conditional on a (unknown) parameter a ; 

(ii) a statistical inference procedure for a involving use of available data 
(realisations of Tj's); 

(iii) a prediction procedure combining (i) and (ii) to allow us to make probability 
statements about future Tj's. 

Of course, the model is an important part of this triad and it seems unlikely that good 
predictions can be obtained if the model is not 'close to reality'. However, a good 
model is not sufficient: stages (ii) and (iii) are vital components of the prediction 
system. In fact disaster can strike at any of the three stages. 

In principle, it ought to be possible to analyse each of the three stages separately so as 
to gain trust in (or to mistrust) the predictions. Unfortunately, it is our experience that 
this is not possible. There are several reasons. 
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In the first place, the models are usually too complicated for a traditional 'goodness-of- 
fit' approach to be attempted. Even the simplest exponential order statistic model [14] 
does not allow this kind of analysis. This should not surprise us: the goodness-of-fit 
problem for independent identically distributed random variables is hard in the presence 
of unknown parameters. The reliability growth context is much worse because of non- 
stationarity. 

Secondly, statistical properties of the estimators of unknown parameters for a non- 
Bayesian analysis of these models are usually not available. For example, several 
models assume that the software contains only a finite number of faults. There is thus 
an upper bound on the number of observable Tj's. This implies that we cannot even 
trust the usual asymptotic theory for maximum likelihood (ML) estimators. Their small 
sample properties are invariably impossibly hard to obtain. 

Of course, there is a proper approach to stages (ii) and (iii) in the Bayesian framework. 
It involves posterior distributions of the parameters at stage (ii) and Bayesian predictive 
distributions for (iii) (see [2]). Unfortunately, this does present some analytical 
difficulties for the popular software reliability growth models. However, with recent 
advances in Bayesian numerical techniques [18], coupled with powerful personal 
computers, this picture may change in the near future. 

Finally, it could be argued that there are models which are 'obviously' better than others 
because of the greater plausibility of their underlying assumptions. We find this a 
dubious proposition. Certainly, the assumptions of some models seem overly naive 
and it might be reasonable to discount them. However, this still leaves others which 
cannot be rejected a priori. It is our belief that understanding of the processes of 
software engineering is so imperfect that we cannot even choose an appropriate model 
when we have an intimate knowledge of the software under study. At some future time 
it may be possible to match a reliability model to a program via the characteristics of that 
program, or even of the software development methodology used. This is not currently 
the case. 

Where does this leave a user, who merely wants to obtain trustworthy reliability metrics 
for his current software project? Our view is that there is no alternative to a direct 
examination and comparison of the quality of the predictions emanating from different 
complete prediction systems. In [1] we have described several ways in which this can 
be done, the most important tools being the u-plot and the prequential likelihood. The 
key idea in each case is that a comparison is made between what has been predicted and 
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what is (later) actually observed. We believe that this emulates how a user would 
informally gain confidence in a sequence of predictions. 

For simplicity we shall concentrate on prediction of the next time to failure Tj, based on 
observations ti, 12 , ... , tj.i. The u-plot uses the predictor £i(t), the estimate of the 
distribution function F,(t) = P(Tj < t), via 

Ui = f'i(ti) (1) 

where ti is the later-observed realisation of the random variable Tj. Thus Ui is the 
probability integral transform of the observation using the predictive distribution 
function. If the sequence of predictions {£i(t;)} is good, it is easy to see that the 
sequence {uj} should look like a random sample from aU(0,l) distribution [1]. There 
are various types of departure from such an appearance which might show themselves; 
here we shall only be concerned with whether the (uj) sequence looks uniformly 
distributed. We shall do this via the u-plot which is the sample cumulative distribution 
(cdf) function of the uj sequence. The departure of this plot from the cdf of U(0,1), the 
line of unit slope, is an indication of a departure of the prediction system from accuracy. 
We can use the Kolmogorov distance, that is the maximum vertical deviation, as a 
measure of this departure and use standard tables to determine whether or not it is 
statistically significant. 

Figure 1 shows u-plots for Jelinski-Moranda [10] and Little wood- Verrall [13] models 
making predictions on a data set, called SI [17], analysed in [1]. These plots are each 
based on 86 predictions: £ 51(0 through £i 36 (t). The Kolmogorov distances are 0.205 
(JM) and 0.150 (LV). The first is significant at the 1% level, suggesting very poor 
prediction from JM; the second is significant at 5%, which suggests that this model is 
also performing poorly but is somewhat superior to JM. 

More importantly for our present purposes, the shape of the plots tells us that JM is 
making predictions which are too optimistic, whilst LV predictions are too pessimistic. 
This can be seen as follows. The JM plot is everywhere above the line of unit slope 
(the true U(0,1) cdf), so there are too many small u; values. But consistently too small 
u values tells us that the model is underestimating the chance of small times between 
failure, i.e. the model is too optimistic. A similar argument shows that a plot which is 
almost everywhere below the line of unit slope, such as LV, is too pessimistic. 
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If we knew that these deviations between predicted and actual behaviour were 
consistent, we could attempt to measure the degree of optimism (or pessimism) and 
improve future predictions by taking account of this tendency. It is this idea which we 
shall develop in the next section. Before we do that, we shall briefly describe the 
prequential likelihood function (PL) which is a general mechanism for comparing the 
accuracy of prediction systems. 

The PL is defined as follows. The predictive distribution £j(t) for Tj based on tj, 
t2, . . . , tj. i will be assumed to have a probability density function (pdf) 

fi(o = h'(o 


For predictions of Tj+i, Tj + 2 , .... Tj+ n , the prequential likelihood is 

j+n 

PLn = n fi(ti) (2) 

i=j+l 

A comparison of two prediction systems, A and B, over a range of predictions of Tj+j, 
Tj+ 2 , . . . Tj +n , can be made via their prequential likelihood ratio 

j+n 

n ?i A (ti) 
i=j+l 

PLR n = (3) 

j+n 

n fjB (tj) 

H+l 

Notice how, in a fashion analogous to the calculation of the u sequence, the individual 
contributions to the prequential likelihood are obtained by substitution into the predictor 
pdf for Tj of the the later-observed realisation tj. Dawid [7] shows that if PLR n — » «, 
as n — > «>, prediction system B is discredited in favour of A. For the finite samples 
with which we inevitably have to deal, we shall argue that PLR n increasing consistently 
suggests the superiority of A over B. In [1] we give intuitive reasons why the PL 
works. Specifically we show that consistent bias or noisiness of a prediction system 
will tend to give a smaller PL than would otherwise be the case. 

To summarise, the PLR can be regarded as a general procedure for choosing the best 
prediction system for a particular data source. The u-plot is a means of indicating a 
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particular kind of consistent inaccuracy of prediction which could be a contributory 
factor in poor predictive accuracy. Thus a poor u-plot might suggest that poor 
predictive accuracy (represented by a poor prequential likelihood) is due to consistent 
bias. For such a case, we shall show in the next section how it is possible to remove 
the bias and so improve the accuracy of reliability predictions. 


3 Recalibration of predictions 


Consider a prediction £j(t) of the random variable Ti, when the true (unknown) 
distribution is Fj(t). Let the relationship between these be represented by the function 
Gi where 

Fi(t) = Gi[ hti) ] (4) 

Obviously, if we knew Gi we could recover the true distribution of Tj from the 
inaccurate predictor, £j(ti). The key notion in our recalibration approach is that in 
many cases the sequence { Gj } is approximately stationary, i.e. it is only slowly 
changing in i. 

If the sequence were completely stationary, i.e. Gj = G for all i, we would have a more 
precise interpretation of the idea of 'consistent bias' used in the previous section. We 
would also have the possibility of estimating the common G from past predictions and 
using it to improve the accuracy of future predictions. 

Of course, in practice such complete stationarity is unlikely to be achieved. However, it 
does seem to be the case that the sequence changes only slowly in many cases. This 
opens up the possibility of approximating Gi with an estimate Gi* and so forming a 
new prediction 

£i*0i) = Gi*[ fc(ti) ]. (5) 

A suitable estimator for Gj is suggested by the observation that Gj is the distribution 
function of Uj = £j(Tj). We shall therefore base our estimate Gj* on the u-plot, 
calculated from predictions which have been made prior to Tj, which is the sample cdf 
formed from the ujs for j<i. The new prediction (5) recalibrates the raw model output, 
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£j(tj), in the light of our knowledge of the accuracy of past predictions for the data 
source under study. The new procedure is therefore a truly predictive one, 'learning' 
from past errors. 

The simplest form for Gi* is the u-plot with steps joined up to form a polygon (Figure 
2). Later we shall consider a version which is smoothed using a spline technique. The 
complete procedure for forming a recalibrated prediction for the next time to failure, Tj, 
is then: 

Stage 1 Check that error in previous predictions is approximately stationary. 

(See [1] for a plotting technique, the y-plot, which detects non- 
stationarity, although we shall see later that recalibration often works 
well even in the presence of non-stationarity) 

Stage 2 Find u-plot for predictions made before Tj, i.e. based on ti, t 2 , . . tj.i, 

and join up the steps to form a polygon, Gi*. 

Stage 3 Use the basic prediction system to make a 'raw' prediction, Fj(ti). 

Stage 4 Recalibrate the raw prediction using (5). 

This whole procedure can be repeated at each stage so that the functions Gi* used for 
recalibration will be based on more information about past errors as i increases. For the 
simple joined-up u-plot this is not computationally onerous: by far the greatest 
computational effort is needed for the statistical inference procedures used to obtain the 
raw model predictions. 

It is important to emphasise that the procedure described above does in fact produce a 
genuine prediction system in the sense described earlier: at each stage we are using only 
past observations to make predictions about the unobserved future failure behaviour. 

Figure 3 shows the effect of recalibration on the predictions made in Figure 1. In the 
case of the JM model it is known that the raw predictions are too optimistic, and the 
recalibration makes them less optimistic; in the case of LV, which is initially too 
pessimistic, the recalibrated version is now less pessimistic. These conclusions are 
confirmed in the more formal analysis based on the u-plot technique: for JM* the 
Kolmogorov distance of the u*-plot is 0.119 (compared with 0.205 for the raw 
predictions), for LV* it is 0.089 (compared with 0.150). Not only are these an 
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improvement in each case, the distances are now no longer statistically significant at the 
10% level. 

Notice that, although Figure 3, for simplicity, only shows median predictions, the 
recalibration is working on the complete predictive distribution. Thus it could be 
expected to improve other reliability estimates, such as the rate of occurence of failures, 
in the examples shown here. The recalibration procedure changes the complete shape 
of the distribution and can therefore correct for far more subtle errors than the mainly 
simple 'optimism' or 'pessimism' of these examples. 

Figure 4 shows an analysis of a data set, SS3 from [17], which exhibits startling 
disagreement between raw predictions from JM and LV models. In fact, in an analysis 
of this data using nine models [5], it can be seen that seven of them are in close 
agreement with one another and are close to the JM plot in Figure 4; the remaining two 
are close to the LV plot in Figure 4. A user might conclude that the seven models 
which give similar answers are closer to the truth than the more isolated pair, but this 
would be wrong. In fact for this data set none is giving acceptable answers. This is 
shown by the u-plots for JM and LV predictions in Figure 5 . Clearly, the JM 
predictions are optimistic, and those from LV pessimistic. The effect is a gross one, as 
can be seen from the Kolmogorov distances, 0.272 (JM) and 0.238 (LV), which are 
very highly significant (well beyond the 1% level, the highest tabulated). The 
prequential likelihood shows that LV is superior to JM [1], but neither of them, nor any 
other model we have used, gives accurate reliability predictions for this data source. 

The detailed shape of the u-plots in Figure 5 is interesting. As was stated above, the 
most notable feature is the extreme optimism or pessimism. However, this is not a 
simple effect in either case. For JM the behaviour of the plot at each extremity suggests 
too many very small u values and too many very large ones. For LV there seem to be 
too many fairly large u's and too few u’s near to 1.0. Thus, although the statements 
above about optimism and pessemism are correct to a first approximation, a more 
detailed analysis shows that the u-plots are giving precise information about the 
incorrect shapes of the complete predictive distributions. It can therefore be seen how 
the recalibration procedure based on such u-plots can effect subtle changes in the 
complete estimated distribution function for the random variable Tj. 

The recalibration technique works dramatically well for this data. Table 1 shows a 
comparison between raw model predictions and recalibrated predictions for the 
following nine models: JM (Jelinski-Moranda, [10]), BJM (Bayesian Jelinski- 


Moranda, [11]), GO (Goel-Okumoto, [9]), MO (Musa-Okumoto, [16]), D (Duane, 
[6]), L (Littlewood, [12]), LNHPP (Littlewood non-homogeneous Poisson process, 
[1]), LV (Littlewood-Verrall, [13]), and KL (Keiller-Littlewood, [1]). 

All nine raw u-plots have Kolmogorov distances which are significant well beyond the 
tabulated 1%. After recalibration, all the distances have been more than halved and 
none are significant at this high level. Figure 6 shows the dramatic improvement given 
by recalibration on the JM and LV u-plots in comparison with the raw predictions (see 
Figure 5). The differences in the detailed median predictions (only for JM and LV 
again, for simplicity) can be seen by comparing Figures 4 and 7 . There is much closer 
agreement between the recalibrated models than between the raw ones. 

In both the above examples there is evidence that prediction systems which were in 
disagreement have been brought into closer agreement by the recalibration technique. 
Much more important, however, we have objective evidence from the comparison of u- 
plot with u*-pIot that recalibrated predictions are less 'biased' than the raw ones. 

These results are encouraging for the efficacy of the recalibration approach, but they are 
not sufficient grounds for assuming, even in the two examples here, that the 
recalibrated predictions should be preferred to the raw ones. It may be that the 
advantage of less bias has been bought at the expense of some other deviation between 
predicted and actual reliability. We have suggested in the previous section that the 
prequential likelihood should be used as arbiter between competing prediction systems 
for any particular data source. It would seem appropriate, therefore, to judge whether a 
raw or recalibrated prediction system is objectively best by comparing their prequential 
likelihoods for a series of predictions. Unfortunately this presents problems for 
recalibrated predictions which are based on the simple polygonal joined-up u-plots 
suggested above. The reason is somewhat 'technical' and is due to the fact that the PL 
uses the probability density function of the predictive distribution: 

j+n j+n 

pl/ = n Vao = n gi*(£i(ti)).?i(ti) 

i=j+l i=j+l 


j+n 

= n gi*(ui). tifo) (6) 

H+i 


from (5), letting gi* denote the derivative of Gi*. 


Unfortunately, since Gi* is a polygon, its derivative gi* is discontinuous. This 
means that f j* is also discontinuous: Figure 8 shows an example of this problem. 
This discontinuity generally causes PL to report badly on the predictive accuracy of a 
recalibrated model in competition with the raw version. A user might therefore 
conclude that recalibration had made the predictions less accurate. We think this can be 
misleading. It is true that it would be unreasonable to believe that the true predictive pdf 
is grossly discontinuous; the rejection of such a pdf by the PL criterion is therefore 
strictly correct. However, in practice users are not directly interested in predictive pdfs 
but in probabilities. Such probabilities will be obtained from the pdf by integration, 
which has the effect of smoothing out the discontinuity. It is therefore perfectly 
possible for PL to reject a recalibrated prediction system in favour of the raw version, 
even when the recalibrated (probability) predictions are the most accurate. A rejection 
in such circumstances is, we believe, unfair: a user needs to know which prediction 
system is performing best for the kinds of prediction he is likely to make. 

There are two ways forward which will be described in the next two sections. 

The first approach attempts to decide whether recalibration can be trusted to give 
improved results in a wide class of circumstances by comparing both recalibrated and 
raw predictions with the true reliability when this is known. In practice, of course, 
such knowledge of the truth is not available so we shall have to use simulated inter- 
failure times. We shall show that in a high proportion of cases the recalibrated 
prediction system is superior to the raw one. However, as might be expected, this is 
not always the case. 

Our second approach, therefore, applies a smoothing to the polygonal Gi* in order to 
give a continuous recalibrated predictive pdf. This allows the use of PL as a criterion 
for judging which prediction system is giving most accurate results. Use of this 
smoothing is computationally more intensive than use of the simple joined-up u-plot. 

A user therefore has a choice: appeal to the general efficacy of the approach as 
demonstrated by the simulation results based on the simple recalibration technique, or 
use the smoothed version and use PL to decide whether recalibration is working in the 
particular example under study. 
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Simulation results 


The simulation experiment [4] consisted of generating 100 realisations of the inter- 
failure time sequence ti, t 2 , . . . tioo from each of the models JM, L, LV, KL and D, 
with constant parameters being used for each model. These data sets were then 
analysed using the 'wrong' models: thus, for example the JM data set was analysed 
using the L, LV, KL and D models. 

The model parameters were estimated based on ti, t 2 , . . . tj_i, to obtain £j(t) for j = 
20, ... 101. Then, for i = 40, . . . 101, the u-plot using uj = £(tj), for j = 20, . . . i-1, 
was used to obtain Gj* and hence iq*(t). It was thus possible to compare the known 
true Fj(t) with the raw predictor £j(t) and with the recalibrated predictor £j*(t) for i = 
40, ... , 101. 

In a particular case a user is interested in knowing whether the raw or recalibrated 
predicted distribution is closer to the true one. There are various ways we could 
examine the differences between predicted and true distributions. Perhaps the most 
obvious is a direct measure of the distance between the two functions, such as the 
Kolmogorov distance. This is defined as follows. For raw predictions let 3i(t) = £j(t) 
- Fj(t) and for recalibrated 3i*(t) = £j*(t) - Fi(t), both for i = 40, . . . 101. The 
Kolmogorov distances are iq = supt>o I cli(t) | = | ^i(x) | and iq* = supt>o I 3i*(t) | = 
I $i*(Y) |. A simpler procedure is to merely check whether the recalibrated or raw 
median is closer to the true one. 

The first analysis concerns only predictions of Tioi; there are 2000 such predictions in 
the experiment. If we consider those predictions of Tjoi for which the u-plot (based on 
predictions prior to Tioi) was significant at the 5% level, indicating that there was 
evidence of bias, 89% of the recalibrated predictions were superior to the corresponding 
raw ones. This figure rises to 92% if we only recalibrate for u-plots which are 
significant at the 1% level. 

Even when we recalibrated regardless of the u-plot evidence, the recalibrated 
predictions improved on raw ones in 61% of cases. Here there will be many cases 
where raw predictions are close to the truth; then we would not expect the recalibration 
to introduce an improvement and the recalibrated and raw predictions should be close to 
one another. However, since the recalibrated predictive distribution is polygonal 
('lumpy'), the Kolmogorov distance (which compares the maximum deviations of the 


two predictions from the truth) will tend to discriminate against the recalibration in 
favour of the raw prediction. This figure of 61% can therefore be thought of as a 
conservative one. 

Other simple comparisons between recalibrated and raw predictions are fairer in this 
situation. For example, the recalibrated median is closer than the raw one to the true 
median in 70% of these cases. This figure rises to 91% when we recalibrate only for u- 
plots significant at 5%, and 94% when we recalibrate only for u-plots significant at 1%. 

These results for Tioi are supported by the more extensive recalibrations of the 
predictions of T 40 , . . , T 101 : here recalibrated medians are closer to the true one in 86 % 
of cases when the u-plot at stage 100 was 5% significant, and are closer in 93% of 
cases when the u-plot is significant at 1 %. 

In summary, even when we blindly used the recalibration on all predictions, there was 
an improvement in about 7 out of 10 cases. More importantly, when we adopted the 
more rational and discriminating approach of only using the technique when the u-plot 
analysis suggested recalibration might be fruitful (by indicating the presence of ’bias'), 
there was improvement about 9 out of 10 times. 

Of course, we do not know whether our simulated data was typical of real software 
reliability data. Indeed, since we were generating data according to several models with 
very different underlying assumptions, some of the data sets are likely to be unrealistic. 
However, we believe that these results are encouraging for the general power of the 
approach. 

In practice a user might wish to have more than a belief in the general efficacy of the 
approach: he needs to know that it is working for the particular data source under 
examination. The obvious approach is to use the methods of analysis of predictive 
quality [1] discussed earlier. In the next section we show how this can be done. 


5 Parametric spline smoothing 


The u-plot is merely the sample cdf of the observed u’s. Thus the problem of 
estimating the approximately stationary function Gj in (4) is simply the problem of 


obtaining an estimate of a cdf from a finite random sample. There are several ways in 
which this can be done so that the estimator is differentiable and so has a smooth pdf. 
We could, for example, fit an appropriate parametric family of distributions to the data. 
An example is the family of Beta(a, P) distributions with pdf 

f(u) = ua-^f- u)P' 1 /B(a, p) 0<u<l (7) 

This is a fairly flexible family, but it is not sufficiently wide to represent all the general 
shapes of u-plots which we have encountered in practice (see [5] for an example). This 
seems likely to be a problem with other candidate parametric families of distributions. 
Another, less important, difficulty is that the evaluation of the cdf is not easy for certain 
regions of the parameter space. 

The need for a method of fitting a very general class of u-plot data suggests the use of 
parametric splines, which are widely used in computer graphics because of their 
versatility. We shall use the cumulative chord as the parameter, whereupon the spline is 
defined as follows. Let {xj, yj}, for i = 1, 2, . . . , r, denote the r points of the u-plot 
to which we want to fit the spline, and let 

Pi' = Pi-l' + [(xi-xi.i) 2 + (yi-yi-l) 2 ] 1/2 (8) 

with po' = 0, xo = 0 and yo = 0; i.e. p,' is the distance from the origin, along the 
polygon, to the ith point. Here Xj is the ith order statistic of the u's and yj is the height 
of the u-plot at xj. For convenience we shall use the normalised chord 

Pi = Pi'/pr' (9) 

so that both parametric functions will have domain [0,1]. 

We now have two sets of data, {xj, pi } and {yi, pj}, to each of which we fit a three 
knot least-squares cubic spline; call these x = x(p) and y = y(p). These splines are each 
constrained so that x(p) and y(p) are strictly increasing functions taking values between 
0 and 1 for p in (0, 1), with x(0) = y(0) = 0 and x(l) = y(l) = 1. It follows that the 
function defined parametrically as (x(p), y(p» is also strictly increasing between 0 and 
1. We call this function the parametric spline and it has the properties of a cdf. More 
importantly for our needs, it is everywhere differentiable with a smooth derivative. 
This means that if we use this function to recalibrate software reliability predictions we 
are certain to obtain a smooth recalibrated predictive density. We can therefore use 
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prequential likelihood as a criterion of predictive accuracy and be confident that we shall 
not encounter the difficulties we met with the polygonal joined-up u-plot. 

Clearly, using this spline is more tedious than recalibrating predictions from the joined- 
up u-plot; details can be found in [5]. However, run times are generally much less than 

■V. 

are required for the original raw predictions. Since these raw predictions must always 
be computed, the small overhead involved in using the spline is worthwhile. Most 
importantly this technique allows a user to determine, via prequential analysis, whether 
the recalibrated predictions are objectively better than the raw ones for a particular data 
source. It also similarly allows comparisons to be made between different recalibrated 
prediction systems. Such knowledge about the performance in a particular instance is 
more valuable than the general assertions of efficacy which come from the earlier 
simulation exercise. 

To distinguish it from the earlier polygonal G*, we shall denote the spline smoothed 
recalibrating function by G**. The recalibrated predictions are then 

£i**(t) = Gi**[ fq(t) ] (10) 

Table 2 shows the u-plot and y-plot Kolmogorov distances for the same data sets as 
those used in Table 1. It can be seen that the entries in the two tables are very similar. 
This is to be expected since the spline recalibrated predictive distribution function is 
designed to be a smooth function close to the joined-up recalibrated predictive 
distribution. If these two functions are close, the u's based on them will be close and 
thus so will the plots. In practical terms this means that the predictions of probabilities 
from the two techniques will be very similar, and in particular their medians are very 
close (compare Figure 9 with Figure 7). However, their predictions of probability 
densities will be very different: it is this difference we wish to exploit in the use of the 
prequential likelihood for the spline version. 

In Figure 10 the evolution of the prequential likelihood ratios is shown for the various 
recalibrated predictions against raw model predictions. Notice how, for LV, the 
prequential likelihood seems to be suggesting that the joined-up recalibrated predictions 
are worse than the raw ones. This is a dramatic example of the effect of the 
discontinuity of joined-up recalibrated probability densities upon the likelihood: it 
causes a spurious rejection of these recalibrated predictions in favour of those from the 
raw model. That this is, indeed, spurious can be seen from the behaviour of the spline 
recalibrated predictions: there is overwhelming evidence that the LV**:LV prequential 



likelihood ratio is increasing rapidly (it has reached more than e 40 during these 
predictions!). A user could therefore be very confident that the LV** predictions here 
are more accurate than the LV ones. 

A comparison of JM** and JM is even more dramatic: the PLR reaches e 90 over the 
range of predictions shown. This is partly due to the fact that raw JM predictions are 
significantly less accurate than those of raw LV (although both are bad from u-plot 
evidence). Thus JM starts off with more room for improvement. In fact, after 
recalibration, the two spline predictors LV** and JM** have comparable accuracy on 
the prequential likelihood evidence, with slight evidence of superiority for JM**. 

Figure 1 1 shows an example of recalibrated probability density functions at stage 278 in 
the SS3 data set.. The two raw predictive densities from LV and JM disagree greatly, 
but after recalibration there is close agreement between LV** and JM**. This is 
illustrated even more dramatically in Figure 12 which shows predictive densities for 
stage 121 in the SI data. Notice here the curious mode which appears in each 
predictive density after recalibration. Neither of the raw predictive densities 
(exponential for JM, Pareto for LV) can have a non-zero mode, which suggests that the 
'learning' from past errors can give an insight not present in the raw models. What is 
particularly striking, we believe, in figures like this is not only the close agreement of 
the two predictions after recalibration, but how dramatically these differ from the raw 
predictions. 

These figures give some indication of the power of the method to change fundamentally 
the raw prediction, on the evidence of analysis of past predictive error. Thus the 
improvements in simple summary statistics shown in the median plots (Figures 3, 7, 9) 
are merely the tip of an iceberg: when recalibration works it will do so in very general 
ways and a user could reasonably expect all reliability measures to improve in accuracy. 


6 Retrodictive recalibration 


The recalibration technique described in this paper is based on an analysis of the 
accuracy of similar predictions at earlier stages in the acquisition of data from testing a 
program. Thus when we came to recalibrate the prediction of Tiqi it was necessary to 
make predictions of T 20 , T 21 , . . , T]qi (each based only on the data observed prior to 


1 7 



making the prediction) in order to calculate the Gj* (or Gi**), i = 40, ... , 101, which 
transforms the raw prediction, £j(t) . For all the models each such prediction is quite 
computationally intensive, so a single recalibration can require considerable effort. If 
recalibration is to take place at each stage as each new inter-failure time is observed, 
then of course this overhead disappears, since it will be necessary to calculate each raw 
prediction anyway. 

However, the problem seemed sufficiently important that we examined a retrodictive 
recalibration procedure which only needs a single basic calculation (e.g. maximisation 
of a likelihood function) for each recalibration. For those models using maximum 
likelihood estimation of the parameters this scheme works as follows. To predict Tioi, 
we use all available data, ti, . . , tioo, to calculate an estimate of the model parameters. 
This is used, of course, to obtain the raw prediction of Tioi- It is also used to retrodict 
(i.e. "predict" the past) Ti, T 2 , . . Tioo- Since we have the actual observations of this 
past, we can compare the retrodictions with these in the same way that we do wdth 
genuine predictions. In particular we can form the retrodictive u-plot and use this to 
recalibrate the raw prediction of Tioi- 

Unfortunately, this procedure seems to be useless! The reason is fairly subtle. It 
seems to be the case that a prediction of Tj, based on ti, . . , tj.i, can be error in 
different ways from a retrodiction of Tj (j<i) also based on ti, . . , tj.i. More precisely, 
the approximate stationarity in the errors of prediction of Tj (based on ti, . . , tj.i) as we 
vary i is very different from the approximate stationarity of errors of prediction of Tj 
(based on tj, . . , ti) as we vary j for fixed i. It seems that we can expect to obtain the 
first kind of approximate stationarity, but not the second: it is, of course, such 
approximate stationarity which underpins the basic idea of recalibration. 

Once again this seems to suggest that in assessing software reliability we must be 
careful of making unfounded generalisations. Just as we cannot assume that a model 
performing accurately on one data set necessarily will give good performance on 
another [1], so we cannot assume that information gained from an analysis of the 
accuracy of one type of prediction will necessarily be trustworthy for another. 
Although these remarks are based on the evidence of retrodictive error being a poor 
guide to one-step-ahead prediction, it is likely that the implications are more far 
reaching. For example, the predictive recalibration method for one-step-ahead 
predictions may not be effective for predictions further ahead. Thus if we wished to 
recalibrate a raw 20-step-ahead prediction it may be necessary use a form of the G 



function which is itself based on a comparison of 20-step-ahead raw predictions with 
actual (later observed) data. We hope to investigate issues of this kind in future work. 


7 Discussion and conclusion 


We have shown that recalibration can be a powerful technique for improving the 
accuracy of software reliability growth predictions. The technique is completely 
general, and in particular is not model-dependent: it can be applied to any predictive 
scheme. It can also be used for different types of prediction, but it should be 
remembered that recalibration should be based on past predictions of the same type. 

Our simulation results for the simple joined-up G* suggest that it offers an improvement 
in accuracy over the original models in a high proportion of cases. This alone would be 
sufficent reason for advocating that it be applied as a matter of course to all models: 
essentially doubling the number of prediction systems available to the user. 

As we have demonstrated elsewhere [1], a user cannot select a model a priori from this 
plethora of available models and know that it is the best for the job. Instead, it is 
necessary to apply all available models to each data source and use the techniques 
described in [1], principally the prequential likelihood, to select the one which is giving 
most accurate reliability predictions for the particular data source (program) under 
study. 

To make this method of discriminating between reliability prediction systems work for 
recalibrated models, we have introduced the notion of a spline-smoothed recalibrated 
prediction. The user is now in a position to apply several models, and their recalibrated 
versions, to his/her data and select that which is objectively performing best. We 
believe that this eclectic approach should in future be standard practice. 

Our results give a new insight into reliability growth modelling. It can now be seen as 
essentially a two stage process: first capturing the long term trend and then using these 
new ideas to estimate local behaviour. A rich class of new models could be formed 
from a distribution-free fitting of trend, followed by a later analysis of detailed 
probabilistic structure along the lines described above. We are currently investigating 
these possibilities: early results are encouraging. 
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Data set 
(no. predictions) 

JM 

BJM 

GO 

MO 

DU 

L 

LNHPP 

LV 

KL 


u 

.2049E 

.1 87 IE 

.1773E 

.0982A 

.1567D 

.1123A 

.0982A 

.1504D 

.1457D 

SI 

u* 

.1188B 

.1226B 

.134 1C 

.0499A 

.0752A 

.0499A 

.0499A 

.0894A 

.0901 A 

(86) 

y 

.1156B 

.1148B 

.1190B 

.0795A 

.1029 A 

.0904A 

.0793A 

.1148B 

.1173B 


y* 

.1018A 

.1016A 

.1076 A 

.0775A 

.0808A 

.0893A 

.0768A 

.0901 A 

.0916A 


U 

.2717E 

.2713E 

.2705E 

.2645E 

.2596E 

.2717E 

.2704E 

.2382E 

.2372E 

SS3 

u* 

.0982C 

.1042D 

.0978C 

.1057D 

.1122D 

.0987C 

.0997C 

.0864B 

.1043D 

(173) 

y 

.1273E 

.1379E 

.1263E 

.1435E 

.1835E 

.129 IE 

.1300E 

.0346A 

.0500A 


y* 

.0577A 

.0664A 

.0579A 

.0631 A 

.0968C 

.0561 A 

.0558A 

.0415A 

.0596A 


Table 1 Kolmogorov distances for u- and y-plots for raw model and for joined- 
up recalibrated predictions. The letters indicate significance levels: E is 
significant at the 1% level, D at 5%, C at 10%, B at 20%, A is not 
significant at 20%. Roughly: A and B are very good, C is acceptable, 
D and E are poor. 


Data set 
(no. predictions) 

JM 

BJM 

GO 

MO 

DU 

L 

LNHPP 

LV 

KL 


u 

.2049E 

.1871E 

.1773E 

.0982A 

.1567D 

.1123A 

.0982A 

.1504D 

.1457D 

SI 

u** 

.1168B 

.1197B 

.1277B 

.0511 A 

.0794A 

.0507A 

.0526A 

.1027A 

.1053A 

(86) 

y 

.1 156B 

.1148B 

.1190B 

.0795A 

.1029 A 

.0904A 

.079 3 A 

.1 148B 

.1 173B 


y ** 

.1109A 

.1 126A 

.1 102A 

.0852A 

.0762A 

.0715A 

.0853A 

.0878A 

.0916A 


u 

.2717E 

.2713E 

.2705E 

.2645E 

.2596E 

.2717E 

.2704E 

.2382E 

.2372E 

SS3 

U** 

.0820B 

.0822B 

.0782A 

.0901B 

.0916B 

.0859B 

.0846B 

.0834 B 

.1006C 

(173) 

y 

.1273E 

.1379E 

.1263E 

.1435E 

.1835E 

.1291E 

.1300E 

.0346A 

.0500A 


y** 

.0573A 

.0693A 

.0560A 

.0632A 

.1016C 

.0571 A 

.0557A 

.0352A 

.04 52 A 


Table 2 As Table 1 but for spline-smoothed recalibrated predictions. 
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Figure 2 Method of drawing the joined-up step recalibrating function, Gj*. Here 
there are r u-points and each step is of size l/(r+l). 
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Figure 3 
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Predictive medians of T51 through T136, raw and recalibrated using 
joined-up recalibrator, Gj*, for Musa System 1 data [ 17 ]. 
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